Polar Codes: Speed of polarization 
and polynomial gap to capacity 

Venkatesan Guruswami* Patrick Xia^ 

■ ■ ■ ■ Computer Science Department 

Camegie Mellon University 

O' Pittsburgh, PA 15213 

(N ■ 

u 

Oh' 
< 



^ 



X 



C^ 



Abstract 



We prove that, for all binary-input symmetric memoryless channels, polar codes enable reliable com- 
munication at rates within e > of the Shannon capacity with a block length, construction complexity, 
and decoding complexity all bounded by a polynomial in l/e. Polar coding gives the^r^f known explicit 
construction with rigorous proofs of all these properties. 

g : We give an elementai-y proof of the capacity achieving property of polar codes that does not rely on 

the martingale convergence theorem. As a result, we are able to explicitly show that polar codes can have 
block length (and consequently also encoding and decoding complexity) that is bounded by a polynomial 
in the gap to capacity. The generator matrix of such polar codes can be constructed in polynomial time 
using merging of channel output symbols to reduce the alphabet size of the channels seen at the decoder 

(N 
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. ; 1 Introduction 

O 

^T) • In this work, we establish that Ankan's celebrated polar codes [21 have the desirable property of fast con- 

vergence to Shannon capacity. Specifically, we prove that polar codes can operate at rates within e > 
of the Shannon capacity of binary-input memoiyless output-symmetric (BIS) channels with a block length 
N = N{e) that grows only polynomially in l/e. Further, a generator matrix of such a code can be de- 



;_( , terministically constructed in time polynomial in the block length N. For decoding, Ankan's successive 



cancellation decoder has polynomial (in fact 0{N log N)) complexity. 

Thus, the delay and construction/decoding complexity of polar codes can all be polynomially bounded as 
a function of the gap to capacity. This provides a complexity-theoretic backing for the statement "polar codes 
are the first constructive capacity achieving codes," common in the recent coding literature. As explained 
below, these attributes together distinguish polar codes from the Fomey/Justesen style concatenated code 
constructions for achieving capacity. 

Our analysis of polar codes avoids the use of the martingale convergence theorem — this is instrumental 
in our polynomial convergence bounds and as a side benefit makes the proof elementary and self-contained. 
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1.1 Context 

Shannon's noisy channel coding theorem impUes that for every memoryless channel W with binary inputs 
and a finite output alphabet, there is a capacity I{W) ^ and constants aw < co and bw > such that 
the following holds: For all e > and integers N ^ av^/e^, there exists a binary code C C {0, 1}^ 
of rate at least I{W) — e which enables reliable communication on the channel W with probability of 
miscommunication at most 2~''^'^ ^ . A proof implying these quantitative bounds is implicit in Wolfowitz's 
proof of Shannon's theorem ll23l . 

This remarkable theorem showed that a constant factor redundancy was sufficient to achieve arbitrarily 
small probability of miscommunication, provided we tolerate a "delay" of processing N channel outputs at 
a time for large enough block length A^. Further, together with a converse theorem, it precisely characterized 
the minimum redundancy factor (namely, 1/I{W)) needed to achieve such a guarantee. It is also known 
that a block length of A^ ^ r2(l/e^) is required to operate within s of capacity and even a constant, say 0.1, 
probability of miscommunication; in fact, a very precise statement that even pinned down the constant in 
the i7(-) notation was obtained by Strassen ||2TI . 

As Shannon's theorem is based on random coding and is non-constructive, one of the principal theoreti- 
cal challenges is to make it constructive. More precisely, the goal is to give an explicit (i.e., constructible in 
deterministic poIy(A^) time) description of the encoding function of the code, and a polynomial time error- 
correction algorithm for decoding the correct transmitted codeword with high probability (over the noise of 
the channel). Further, it is important to achieve this with small block length N as that corresponds to the 
delay at the receiver before the message bits can be recovered. 

For simplicity let us for now consider the binary symmetric channel (BSC) with crossover probability p, 
< p < 1/2, denoted BSCp (our results hold for any BIS channel). Recall that BSCp flips each input bit 
independently with probability p, and leaves it unchanged with probability 1 — p. The Shannon capacity of 
BSCp is 1 — h{p), where h{x) = —x log2 x — {I — x) log2(l — x) is the binary entropy function. For the 
BSC, the capacity can be achieved by binary linear codes. 

One simple and classic approach to construct capacity-achieving codes is via Forney's concatenated 
codes ||9l- We briefly recall this approach (see, for instance, [11, Sec. 3] for more details). Suppose we 
desire codes of rate 1 — h{p) — e for communication on BSCp. The idea is to take as an outer code any 
binary linear code Cout C {0, 1}*^° of rate 1 — e/2 that can correct a fraction 7(e) > of worst-case errors. 
Then, each block of 6 = Q{\ Iog(l/7)) bits of the outer codeword is further encoded by an inner code of 
rate within e/2 of Shannon capacity (i.e., rate at least l — h{p) — s/2). This inner code is constmcted by brute 
force in time exp(C'(&)). By decoding the inner blocks by finding the neai^est codeword in exp(C'(6)) time, 
and then correcting up to 7(e) no errors at the outer level, one can achieve exponentially small decoding 
error probability. However the decoding complexity grows like uq exp(C'(6)). Thus both the construction 
and decoding complexity have an exponential dependence on 1/e. In conclusion, this method allows one to 
obtain codes within e of capacity with a block length polynomially large in 1/e. However, the construction 
and decoding complexity grow exponentially in 1/e, which is undesirableHJ 

1.2 Our result: polynomial convergence to capacity of polar codes 

In this work, we prove that Arrkan's remarkable polar codes allow us to approach capacity within a gap 
e > with delay (block length) and complexity both depending polynomially on 1/e. Polar codes are the 



'One can avoid the brute force search for a good inner code by using a small ensemble of capacity-achieving codes in a Justesen- 
style constmction [15]. But this will require taking the outer code length no > exp(l/£^), causing a large delay. 



first known construction with this propertyo 

Below is a fomial statement of the main result, stated for BIS channels. For general, non-symmetric 
channels, the same claim holds for achieving the symmetric capacity, which is the best rate achievable with 
the uniform input bit distribution. 

Theorem 1. There is an absolute constant /u < oo such that the following holds. Let W be a binary-input 
memoryless output- symmetric channel with capacity I{W). Then there exists a\Y < oo such that for all 
e > and all powers of two N ^ awi^/^)^, there is a deterministic poly(A^) time construction of a binary 
linear code of block length N and rate at least liyV) — e and a deterministic N ■ poly(log A'^) time decoding 
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algorithm for the code with block error probability at most 2~ for communication over W. 



Remarks: 

1. Using our results about polar codes, we can also construct codes of rate I{W) — e with 2~^^^^' 
block error probability (similar- to Shannon's theorem) with similar- claims about the construction and 
decoding complexity. The idea is to concatenate an outer code that can correct a small fraction of 
worst-case errors with a capacity-achieving polar code of dimension poly(l/e) as the inner code. 
A similar- idea with outer Reed-Solomon codes yielding 2~^(^/P°'y(^°s^)) block error probability is 
described in ||6l. 

2. The construction time in Theorem [T]can be made poly(l/e) + 0{N log N). As our main focus is on 
the finite-length behavior when N is also poly(l/e), we are content with stating the poly(A^) claim 
above. 

Showing that polar codes have a gap to capacity that is polynomially small in 1/A^ is our principal con- 
tribution. The decoding algorithm remains the same successive cancellation decoder of Arikan |2]. The 
proof of efficient constructibility follows the approach, originally due to Tal and Vardy ll22l . of approxi- 
mating the channels corresponding to different input bits seen at the decoder by a degraded version with a 
smaller output alphabet. The approximation error of this process and some of its variants were analyzed in 
|[T9l . We consider and analyze a somewhat simpler degrading process. One slight subtlety here is that we 
can only estimate the channel's Bhattacharyya parameter within error that is polynomial in 1/A^ in poly(A^) 
time, which will limit the analysis to an inverse polynomial block error probability. To get a block error 
probability of 2~ we use a two step construction method that follows our analysis of the polarization 

process. As a bonus, this gives the better construction time alluded to in the second remark above. 

Prior to our work, it was known that the block error probability of successive cancellation decoding of 
polar codes is bounded by 2~^ for rate approaching I{W) in the limit of A^ — )■ oo 15]. However, the 
underlying analysis found in ||5l, which depended on the martingale convergence theorems, did not offer 
any bounds on the finite-length convergence to capacity, i.e., the block length N required for the rate to 
be within e of the capacity I{W). To quote from the introduction of the recent breakthrough on spatially 
coupled LDPC codes ifTSl : 

"There are perhaps only two areas in which polar codes could be further improved. First, for 
polar- codes the convergence of their peifoi-mance to the asymptotic limit is slow. CuiTcntly no 



^Spatially coupled LDPC codes were also recently shown to achieve the capacity of general BIS channels 1181 . This construction 
gives a random code ensemble as opposed to a specific code, and as far as we know, rigorous bounds on the code length as a function 
of gap to capacity are not available. 



rigorous statements regarding this convergence for the general case are known. But "calcula- 
tions" suggest that, for a fixed desired error probability, the required block length scales Uke 
1/5^, where 5 is the additive gap to capacity and where /i depends on the channel and has a 
value around 4.'lj 

The above-mentioned heuristic calculations are based on "scaling laws" and presented in lITTl . We will 
return to the topic of scaling laws in Section [L4] on related work. 

We note that upper bounds on the block length A^ as a function of gap e to capacity are crucial, as 
without those we cannot estimate the complexity of communicating at rates within e of capacity. Knowing 
that the asymptotic complexity is 0{N log N) for large N by itself is insufficient (for example, to claim that 
polar codes ai^e better than concatenated codes) as we do not know how large N has to be! While an explicit 
value of /i in Theorem HJcan be calculated, it will be rather large, and obtaining better bounds on /x, perhaps 
closer to the empirically suggested bound of ss 4, is an interesting open problenQ 

1.3 Techniques 

Let us first briefly discuss the concept of polarization in Arikan's work, and then turn to aspects of our work. 
More formal background on Arikan's construction of polar codes appears in Section|3](with slightly different 
and notation that is more conventional in the polar coding literature). A good, easy to read, reference on 
polar codes is the recent survey by §a§oglu |j7]. 

Fix W to be an arbitrary symmetric channel. If we have a capacity-achieving binary linear- code C of 
block length N for W , then it is not hard to see that by padding the generator matrix of C one can get an 
N X N invertible matrix An with the following polarization property. Let u G {0, 1}^ be a uniformly 
random (column) vector. Given the output y oi W when the A^ bits x = Aj\[U are transmitted on it, for a 
1 — o(l) fraction of bits Uj, its conditional enti"opy given y and the previous bits ui, . . . , iij_i is either close 
to (i.e., that bit can be determined with good probability) or close to 1 (i.e., that bit remains random). Since 
the conditional entropies of u given y and x given y are equal to each other, and the latter is ^ {1—I{W))N, 
the fraction of bits Ui for which the conditional entropy given y and the previous bits ui, . . . , ui-i is « 
(resp. s:i 1) is « I{W) (resp. w 1 - I{W)). 

Arikan gave a recursive construction of such a polarizing matrix A^ for A^ = 2": Ajy = Gf^Bn where 
G2 = (01) and Bn is a permutation matrix (for the bit-reversal permutation) . In addition, he showed that 
the recursive structure of the matrix implied the existence of an efficiently decodable capacity-achieving 
code. The constmction of this code amounts to figuring out which input bit positions have conditional 
entropy « 0, and which don't (the message bits Uj coixesponding to the latter positions are "frozen" to 0). 

The proof that An has the above polarization property proceeds by working with the Bhattacharyya pa- 
rameters Zn{i) G [0, 1] associated with decoding Uj from y and ui, . . . , Ui-i. This quantity is the Hellinger 
affinity between the output distributions when u-i = and Ui = 1, and is a better quantity that conditional 
entropy to work with. The values of the Bhattacharyya parameter of the 2" bit positions at the n'th level can 
be viewed as a random variable Z„ (induced by the uniform distribution on the 2" positions). The simple re- 
cursive construction of An enabled Arikan to proved that the sequence of random variables Zq, Zi, Z2, . . . 



''The second aspect concerns universality: the design of polar codes depends on the channel being used, and the same code may 
not achieve capacity over a non- trivial class of chaimels. 

''while we were completing the writeup of this paper and circulating a draft, we learned about a recent independently-derived 
result in II12I stating that /x ~ 6 would suffice for block error probabilities bounded by an inverse polynomial. Our analysis 
primarily focuses on the 2^^ block error probability result. 



form a supermartingale. In particular, Zn+i equals Z^ with probability 1/2 and is at most 2Z„ — Z^ with 
probability 1/2. d 

One can think the evolution of the Bhattacharyya parameter as a stochastic process on the infinite bi- 
nary tree, where in each step we branch left or right with probability 1/2. The polarization property is 
then established by invoking the martingale convergence theorem for supermartingales. The martingale 
convergence theorem implies that lim„_5.oo \Zn+i — Zn\ = 0, which in this specific case also implies 
lim„_^oo Zn{l — Zn) = or in other words polarization of Z„ to or 1 for n — > oo. However, it does 
not yield any effective bounds on the speed at which polarization occurs. In particular, it does not say 
how large n must be as a function of e before E[Z„(1 — Z„)] ^ e; such a bound is necessary, though not 
sufficient, to get codes of block length 2" with rate within e of capacity. 

Starting at the root, the expected number of steps before which Z„(l — Zn) ^ e for the first time can 
be as large as fl{l/e), even for the binary erasure channel. Note that we need a bound of n ^ C'(log(l/e)) 
to have any hope of obtaining a polynomial dependence of the block length on the gap to capacity. Thus 
this situation demands that with high probability C'(log(l/e)) steps suffice (for Z„(l — Z„) to fall below e) 
even though the expected number of steps for this to happen is Q{l/e). 

Rather than trying to control the ill-behaved random variable that counts the number of steps needed for 
Z„(l — Zn) to drop below e, we simply prove that E[Z„(1 — Zn)] decreases by a constant factor in each step. 
This immediately implies that E[Z„(1 — Zn)] ^ p" for some p < 1, and thus n = C'(log(l/e)) suffices to 
ensure E[Z„(1 — Zn)] ^ £ (we call this rough polarization). 

The above expectation bound is itself, however, not enough to prove Theorem [T] What one needs is 
fine polarization, where the smallest ss I(W)N values among Zn{i) all add up to a quantity that tends to 

)\rO 49 

for large A'^ (in fact, this sum should be at most 2~ if we want the block error probability claimed in 

Theorem [T|). We establish this by using Chernoff-bound arguments (similar to ||5l) to bootstrap the rough 
polarization to a fine polarization. 

Our analysis is elementary and self-contained, and does not use the martingale convergence theorem. 
The ingredients in our analysis were all present explicitly or implicitly in various previous works. However, 
it appears that their combination to imply a polynomial convergence to capacity has not been observed 
before, as evidenced by the explicit mention of this as an open problem in the literature, eg. ||T6] Section 
6.6], ifTSl Section la], |[20l Section 1.3], and |[22l Section I] (see the discussion following Corollary 2). 



1.4 Related work 

The simplicity and elegance of the construction of polar codes, and their wide applicability to a range of 
classic information theory problems, have made them a popular choice in the recent literature. Here we only 
briefly discuss aspects close to our focus on the speed of polarization. 

Starting with Ankan's original paper, the "rate of polarization" has been studied in several works. How- 
ever, this refers to something different than our focus; this is why we deliberately use the term "speed of 
polarization" to refer to the question of how large n needs to be before, say, Z„ is in the range (e, 1 — e) with 
probability e. The rate of polarization refers to pinpointing a function T with T(n) — )• for large n such 
that lim„^oo Pr[Zn ^S T(n)] = I{W). Arrkan proved that one can take T(n) = 0{2~^'^/'^) S, and later 
Aiikan and Telatar established that one can take T(n) = 2~^ " for any /? < 1/2 ||5|. Further they proved 
that for 7 > 1/2, lim„_j>oo Pr[^n ^ 2~^^"] = 0. This determined the rate at which the Bhattachaiyya 



^For the special case of the binaiy erasure channel, the Bhattacharyya parameters simply equal the probability that the bit is 
unknown. In this case, the upper bound of 2Z„ — Z^ becomes an exact bound, and the Zi'& form a martingale. 



parameters of the "noiseless" channels polarize to in the limit of larger n. More fine grained bounds on 
this asymptotic rate of polarization as a function of the code rate were obtained in ||T31 . 

For our purpose, to get a finite-length statement about the performance of polar codes, we need to 
understand the speed at which Pr[Z„ ^ T(?i)] approaches the limit I{W) as n grows (any function T with 
T(n) = o(l/2") will do, though we get the right 2^^ type decay). 

Restated in our terminology, in [10] the authors prove the following "negative result" concerning gap 
to capacity: for polar coding with successive cancellation (SC) decoding to have vanishing decoding eixor 
probability at rates within e of capacity, the block length has to be at least (1/e)^-^^^. (A slight caveat is 
that this uses the sum of the error probabilities of the well-polarized channels as a proxy for the block error 
probability, whereas in fact this sum is only an upper bound on the decoding error probability of the SC 
decoder.) 

Also related to the gap to capacity question is the work on "scaling laws," which is inspired by the 
behavior of systems undergoing a phase transition in statistical physics. In coding theory, scaling laws were 
suggested and studied in the context of iterative decoding of LDPC codes in [T]. In that context, for a 
channel with capacity C, the scaling law posits the existence of an exponent /i such that the block enor 
probability Pe{N, R) as a function of block length N and rate R tends in the limit of A^ — )• oo while fixing 
N^''^{C' — R) = X, to f{x) where / is some function that decreases smoothly from 1 to as its argument 
changes from — oo to +oo. Coming back to polar codes, in lITTl . the authors make a Scaling Assumption 
that the probability Qn{x) that Z„ exceeds x is such that lim„_5.oo N^'^Qn{x) exists and equals a function 
Q{x). Under this assumption, they use simulations to numerically estimate fi w 3.627 for the BEC. Using 
the small x asymptotics of Q{x) suggested by the numerical data, they predict an « (1/e)'^ upper bound on 
the block length as a function of the gap e to capacity for the BEC. For general channels, under the heuristic 
assumption that the densities of log-likelihood ratios behave like Gaussians, an exponent of /i ?a 4.001 is 
suggested for the Scaling Assumption. However, to the best of our knowledge, it does not appear that one 
can get a rigorous upper bound on block length A^ as a function of the gap to capacity via these methods. 

2 Preliminaries 

We will work over a binary input alphabet B = {0, 1}. Let W : B ^ y he a. binary-input memoryless 
symmetric channel with finite output alphabet y and transition probabilities {14^(y|2;) : x £ B,y € y}. A 
binary-input channel is symmetric when the two rows of the transition probability matrix are permutations 
of each other; i.e., there exists a bijective mapping a : y ^-^ y where a = a~^ and M^(?/|0) = W{(j{y)\l) 
for all y. Both the binary erasure and binary symmetric channels ai^e examples of symmetric channels. 

Let X represent a uniformly distributed binary random variable, and let Y represent the result of sending 
X through the channel W . 

The entropy of the channel W, denote H{W), is defined as the entropy of X, the input, given the output 
Y , i.e., H{W) = H{X\Y). It represents how much uncertainty there is in the input of the channel given the 
output of the channel. The mutual information of W , sometimes known as the capacity, and denoted I{W), 
is defined as the mutual infoiTnation between X and Y when the input distribution X is uniform: 

I{W) = I{X- Y) = H{X) - H{X\Y) = 1 - H{X\Y) = 1 - H{W) . 

We have ^ I{W) ^ 1, with a lai^ger value meaning a less noisy channel. As the mutual information 
expression is difficult to work with directly, we will often refer to the Bhattacharyya parameter of H^ as a 



proxy for the quality of the channel: 

Z{W) = Y, VW{y\0)Wiy\l) . 

y&y 

This quantity is a natural one to capture the similarity between the channel outputs when the input is and 
1: Z{W) is simply the dot product between the unit vectors obtained by taking the square root of the output 
distributions under input and 1 (which is also called the Hellinger affinity between these distributions). 

Intuitively, the Bhattachaiyya parameter Z{W) should be near when H{W) is near (meaning that it 
is easy to ascertain the input of a channel given the output), and conversely, Z{W) is near 1 when H{W) is 
near 1. This intuition is quantified by the following expression (where the upper bound is from ||T61 Lemma 
1.5] and the lower bound is from O): 

Z{Wf ^ H{W) ^ Z{W) . (1) 

Given a single output y G 3^ from a channel W , we would like to be able to map it back to X, the input 
to the channel. The most obvious way to do this is by using the maximum-likelihood decoder: 

X = argmaxPr(a:|y) = argniaxM^(?/|x) 

where a decoding error is declai^ed if there is a tie. Thus, the probability of enor for a uniform input bit 
under maximum likelihood decoding is 



PeiW) = Pr(l / ^) = ^ E E ^(yl^) ^Wiy\.HWiy\ml) 

xeByey 
where 1^ denotes the indicator function of x. Directly from this expression, we can conclude 

Pe{W) ^ Z{W) (2) 



since '^w {y\x)K:W (y\x®i) ^ V^^^yk©]^ / V^^^yR) > ^rid the channel is symmetric (so the sum over x € i3 
and the 1/2 cancel out). Thus, the Bhattachaiyya pai^ameter Z{W) also bounds the error probability of 
maximum likelihood decoding based on a single use of the channel W. 

3 Polar codes 

3.1 Construction preliminaries 

This is a short primer on the motivations and techniques behind polar coding, following EllTl. Consider a 
family of invertible linear transformations G„ : B"^" — )■ B"^" defined recursively as follows: Go = [1] and 
for a 2A^-bit vector u = {uq,ui, . . . , n2Ar_i) with A^ = 2", we define 

Gn+lU = Gn{uQ ®Ul,U2BU3,... ,U2N~2 ®U2N~l) o G„(ui, lis, lis, . . . ,n2Af-l) (3) 

where o is the vector concatenation operator. More explicitly, this construction can be shown to be equivalent 
to the explicit fomi G„ = K®^Bn (see \2j Sec. VII]) where Bn is the 2" x 2" bit-reversal permutation 

matrix for n-bit strings, K = and ® denotes the Kronecker product. 



Suppose we use the matrix Gn to encode a A^ = 2"-size vector U, X = Gnll, and then transmit X over 
a binary symmetric channel W. It can be shown with a Martingale Convergence Theorem-based proof ||21 
that for all e > 0, 

^Hmut\Yo''-') <e\= I{W). (4) 



lim Pr 



where the notation U^ denotes the subvector {Ui, C/j+i, . . . , Uj). 

In words, there exists a good set of indices i so that for all elements in this set, given all of the outputs 
from the channel and (correct) decodings of all of the bits indexed less than i, the value of Ui can be 
ascertained with low probability of error (as it is a low-entropy random variable). 

For every element that is outside of the good set, we do not have this guarantee; this suggests a encoding 
technique wherein we "freeze" all indices outside of this good set to a certain predefined value (0 will do). 
We call the indices that are not in the good set as the frozen set. 

3.2 Successive cancellation decoder 

The above distinction between good indices and frozen indices suggests a successive cancellation decoding 
technique where if the index is in the good set, we output the maximum-likelihood bit (which has low proba- 
biUty of being wrong due to the low entropy) or if the index is in the frozen set, we output the predetermined 
bit (which has zero probability of being incorrect). A sketch of such a successive cancellation decoder is 
presented in Algorithm [T] 

Definition 1. A polar code with frozen set F C {0, 1, . . . ,N — 1} is defined as 

Cf = {GnU I n G {0, 1}^,UF = 0} . 

By (01), if we take F to be the positions with conditional entropy exceeding e, the rate of such a code 
would approach I{W) in the limit A^ — >• oo. 

To simplify the probability calculation (as seen on line [6] of Algorithm [T] and explained further in the 
comments), it is useful to consider the induced channel seen by each bit, Wn : B — )• 3^^ x B\ for 
^ i ^ 2" — 1. Here, we are trying to ascertain the most probable value of the input bit Ui by considering 
the output from all channels I^q ^ ^^'^ ^^^ (decoded) input from all channels before index i. Since the 
probability of decoding error at every step is bounded above by the corresponding Bhattacharyya parameter 
Z by ([2]), we can examine Z(Wn ) as a proxy for Pf.{Wn )■ 

It will be useful to redefine Wn recursively both to bound the evolution of Z(Wn ) and to facilitate the 
computation. Consider the two transformations ~ and "*" defined as follows: 

W^{yi,y2\xi)= ^ -W{yi\xi(Bx2)W{y2\x2) (5) 

X2&B 

and 

W^{yi,y2,xi\x2) = -W{yi\xiex2)W{y2\x2). (6) 

This process ^ and ^ preserve infonnation in the sense that 

I{W-) + I{W+) = 2I{W), (7) 



Algorithm 1: Successive cancellation decoder 



input ■.y^~^,F,W 
output: Uq^^ 




1 M -^ zero vector of size N 




2 fori GO..iV- Ido 




3 


if i G F then 




4 


Ui^O 




5 


else 




6 




.„ Pr{U,=f)\Ul'^=ul-\Y^- 


~yo 1 ^ 1 thpn 


Vr{U,=l\W^-^=^-\Y^^- 


-1 ,iv-iN ^ J^ incn 


7 




Ui ^ 




8 




else 




9 




Uj ^ 1 




10 r 


etur 


■nup 





Remark. The probability ratio on line[6]can be computed with a naive approach by recursively 
computing (where n = Ig A^) Wn {Vq ~ ^ , u}^ ^\x) for x & B according to the recursive evolution 
equations Q,©,®. The result is true if the expression is larger for x = 1 than it is for a; = 0, as by 
Bayes's theorem, 



i-l -i^TV-l _ ,,N-1\ _ ^^n {Uq , »o 



Pr(^,=0|[/r=^r,V" =<"') 






and the term in the denominator is present in both the C/j = and Ui = I expression and therefore 
cancels in the division; the Pr(ui = 0) temi cancels as well for a uniform prior on ut (which is 
necessary to achieve capacity for the symmetric channel W). 

The runtime of the algorithm can be improved to 0{N log A'^) by computing the probabilities on line 
[6] with a divide-and-conquer approach as in fZl. We note that this runtime bound assumes 
constant-time arithmetic; consideration of n-bit arithmetic relaxes this bound to 0(A^polylog(A^)). 
For a treatment of more aggressive quantizations, see ||T21 Chapter 6]. 



which follows by the chain rule of mutual information, as (suppose Xi is the input seen at W and X2 is 
the input seen at W~^ and Yi , I2 are the corresponding output variables) 

I{W-) + I{W+) = I{Xi;Yi,Y2) + I{X2;Yi,Y2\Xi) = I{Xi,X2;Yi,Y2) = 2I{W). 

We also associate ~ with a "downgrading" transformation and + with an "upgrading" transformation, as 

i{w-) < I{W) ^ I{W+). 

Tying the operations ~ and + back to Z{Wn ), we notice that W^ = W^ (the transformation ~ adds 
uniformly distributed noise from another input X2, which is equivalent to the induced channel seen by the 
0th bit) and W~^ = W[ (where here we clearly have the other input bit). More generally, by the recursive 



construction Q, one can conclude that the Wn process can be redefined in a recursive manner as 

with the base channel Wq = W. 

The evolution of /(VF+) and I(W~) is difficult to analyze, but we will see in the next section that we 
can adequately bound Z{W^) and Z{W~) as a proxy. Such bounds are sufficient for analyzing our decoder, 
as we can bound the block enor probability obtained by the successive cancellation decoder described in 
algorithm [T] with bounds on the Bhattacharyya parameters of the subchannels. The probability of the ith (not 
frozen) bit being misdecoded by the algorithm, given the channel outputs and the input bits with index less 
than i, is bounded above by Z{Wn ) by equation (|2]l. This observation, with the union bound, immediately 
gives the following lemma. 

Lemma 2. TJie block error probability of Algorithm\l\on a polar code C of length n with frozen set F is 
bounded above by the sum of the Bhattacharyya parameters J2i£F ^(^n )■ 

3.3 Bounds on Z{W~) and Z{W+) 

The general technique of the proof of these bounds is borrowed from ||2l[T6l, and the results are rederived in 
Appendix |A] for clarity and completeness. 

Proposition 3. Z{W^) = Z{W)'^ for all binary symmetric channels W. 



Proposition 4. Z{W)^/2- Z{WY ^ Z{W~) ^ 2Z{W) - Z{Wf for all binary symmetric channels 
W, with equality holding in the upper bound on Z(W~) if the channel W is an erasure channel. 

4 Speed of polarization 

Our first goal is to show that for some ?Ti = C'(log(l/e)), we have that Pri[Z(w4*^) ^ 2-*^(™)] ^ I{W)-e 
(the channel is "roughly" polarized). We will then use this rough polarization result to show that, for some 
n = C'(log(l/e)), "fine" polarization occurs: Prj[Z(VFn ) ^ 2~^ "] ^ I{W) — e. This approach is similar 
to the bootstrapping method used in Q- 

4.1 Rough polarization 

First, let us state what is meant by rough polarization. We give a formal statement in the proposition below. A 
similar statement can be constructed for binary erasure channels (as opposed to general symmetric channels) 
with a much simpler proof; we include the statement and the simpler analysis in Appendix |B] 

Proposition 5. For all binary -input symmetric channels W and p G (A^, 1), there exists a constant Cp 
(independent ofW) such that for all e > and m ^ bp log(l/e), there exists a roughly polarized set 

Wr C W = {W^^ : < i ^ 2™ - 1} (9) 

such that for all M £ Wr, Z{M) ^ 2p" and Fri{wln^ G Wr) > I{W) - e. 
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The idea here is that we will use what we call symmetric Bhattacharyya parameter as a proxy for how 
polarized a channel is (in a sense, how close to or 1 it is): 

Y(W) = Z{W){1 - Z{W)), 

and then develop a bound on Y{W). 

To relate Y{WJi^) back to Z{WJi^), it is useful to define the sets (where p G (0, 1)): 



A^. - {. : ZiWi^^) , WTIi^} , 4 . {. : z(Ty«) , i±4Ii^} , and 



(10) 



We associate Ap with the "good" set (the set of i such that the Bhattacharyya parameter, and therefore prob- 
ability of misdecoding, is small) and A'' with the "bad" set. We record the following useful approximations, 
both of which follow from y/1 - 4p" ^ 1 - 4p". 

Fact 6. Fori e A% Z{W^~^) ^ 2p", and fori e A^^, Z{W^^) ^ 1 - 2/>'^. 

By Propositions [3] and m we can write 

Z{wl^l{) = Z{W,^y™f for i odd 

Z(T^iLV2j))^2-Z(W^^LV2j))2 ^ z(M^i*|i) ^ 2Z(iyiLV2j)) _ z{wi^'™f for i even . 

This means we can also write the coixesponding expression for Y{W^]_-^^): 

Y{wi:i) = z{wi:i){i-z{wi:i)) 

= Z(l^iLV2j))2(i _ ^(^aV2j))2) f^^ . ^jj 

^ (2Z(W^^LV2j))_^(^aV2j))2\ ("i _ Z(M^^LV2J))^2 - Z(wiLV2j))2^ fori even 



where we have used both sides of the bound from Proposition |4]to form the second expression. 
After rearrangement of terms, the expression above becomes 

(Z{Wr[^'™){l + Z(Vl4L^/2J))) fori odd 



We first state a bound on the evolution of \/Y{wI^Ii) in terms of the parameters above. 

Lemma 7. Let f{z) = \ ( yjz{l + z) + W (2 - ^)^^=^^^ ) and A = max^g^^i] /(z). Then 

i mod 2 ' 

where the meaning of the expectation is that we fix \i/2\ and allow i mod 2 to vary. 
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Proof. Expanding the expectation expression with (ITTI ). obtain 



J mod 2 ' 



Since Z{W) G [0, 1] for all channels W, the bound with A follows. D 



We now bound A away from 1 which implies a geometric decay of the expected value Ej[y y(VF,i )] 
taken over a uniformly random i,0^i^2" — 1. 

Lemma 8. Let A be defined as in Lemma^ Then we have A < 1. 

We relegate the proof of Lemma |8]to Appendix O but we note that A < 19/20, which can be verified 
numerically by maximizing f{z) as defined over the interval [0, 1]. While preparing this paper, we found 
a more precise numerical bound in ||T31 that (|T2l i holds with A = 1.85/2, which was obtained by using a 
tighter expression for f{z). 

Corollary 9. Taking A as defined in Lemma^ Fri[Y{wi''')] ^ a"] ^ ^ (^^"' 
Proofi Clearly we have 



and we can therefore use Markov's inequality to obtain the desired consequence. D 

We are now in a position where we can conclude Proposition |5] 
Proof of Proposition [5] We have 

Pr(X;) max(I(W^*))) + Pr(A^) maxI(VFW) + Pr(^3) max/(wi*)) ^ ¥.{I{W^^)) = I{W) (13) 

where the last equality follows by the conservation of mutual information in our transformation as stated in 
equation (jT]). 

From ||2] Proposition 1], I{WY ^ 1 — Z{W)^ for any binary discrete memoryless channel W. As 
miiijg^fc Z{Wn) ^ 1 - 2/9" by Fact|6j we have max^g^^b I{Wn ) < 2p"/^. Using this together with 
equation (IT3] ). obtain 

Y>v{Ap) + Pr(A^) • 2p"/2 + pr(^5) ^ nw) 

where we used the trivial inequality (apparent from the definition of / of a binary-input channel) Z{Wn ) ^ 
1 for every i. Reaixanging teiTns, using the bounds Pr(Ap) ^ ^(A^/p)"' ^ from Corollary |9]and Z{Wn ) ^ 
2/9™ for i G Ap from Fact|6l we get 

Pr[Z(W^)) ^ 2p™] ^ Pr(A9) ^ /(VF) - -(AVp)""/^ - 2p™/2 _ (14) 

« '^2 

Clearly, if p > A^, there is a constant hp such that m ^ bp log(l/e) implies that the above lower bound is at 
least I{W) -e. D 
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4.2 Fine polarization 

The following proposition formalizes what we mean by "fine polarization." 

Proposition 10. Given e £ (0, 1/2), a binary input memoryless channel W, a parameter 5 £ (0, 1/2), 
there exists a constant as (independent ofW and e) such that ifriQ > cs log(l/e) then 



Pr 



Z(Ty«)^2-2^""l^/W-e. 



We will first need the following lemma to specify one of the constants. 

Lemma 11. For a/Z 7 > 0, /3 G (0, 1/2) and p £ (0, 1), there exists a constant 9{(3, 7, p) such that for all 

e £ (0, 1), ifm > 9{f3,-y,p) ■ log(2/e), then 



Proof. We can rewrite this expression as ci exp(— C2m) < s for constants ci, C2 that are independent of e 
and the result is clear. D 

Proof of PropositionU^ Fix a /3 G (S, 1/2), and let 7 = -g^. Let p be an appropriate constant for Proposi- 
tion [51 and bp be the associated constant. We will define 

cs = (l + 7)max{26p,6l(/3,7,p),2/p,c/3}, 

where c^ is defined as a bound on m such that if m ^ C/j, then 

1-2-^^^ ' 

C/3 = 21J7& '"^^''^'• 

Fix an uq > cs log(l/e), m = ji—no and n = uq — m = 7m. We first start with a set of roughly 
polarized channels; by our choice of cs, m > &pIog(2/e) and we can apply Proposition [5] and obtain a set 
yVr where 

Pr[zW G Wr] > I{W) - e/2 (15) 

i 

and Z{M) ^ 2^"" for all M £ Wr- Let R{m) be the set of all associated indices i in Wr- 
Fix a Af G Wr and define a sequence {Zn } where 

-(,) _/(zF'J))2 i mod 2=1 
I 2Z„ ^ mod 2 = 

with the base case Z^°^ = Z{M). Clearly Z{mI^') ^ zi*^. (Recall that xli^ is the polarization process 
done for n steps with i determining which branch to take for arbitrary binary input channel X.) 
Let Cp = \2mrwri\ ■ ^^^ a /? G (0, 1/2). Define a collection of events {Gj{n) : 1 ^ j ^ Cp}: 

G,(n) = |i: Yl ik^Pri/cp\- (17) 

fc6[jn/cp, 
(i+l)n/cp) 
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here, i^ indicates the /c'th least significant bit in the binary representation of i. Qualitatively speaking, Gj 
occurs when the number of I's in the j'th block is not too small. 

Since each bit of i is independently distributed, we can apply the Chemoff-Hoeffding bound llT4l (with 

p = 1/2 and e = 1/2 — j3) to conclude 

Pr(i G Gj{n)) ^ 1 - exp(-2(l/2 - (3fn/{cp)) 

i 

= l-exp((l-2/3)V(2cp)) (18) 

for all j G [cp]. 
Define 

G{n) = f]Gj{n). (19) 

j 
Applying the union bound to G{n) with (fTSl ). obtain 

Pr(i G G(n)) ^ 1 - Cpexp(-(1 - 2pfn/{2cp)). (20) 

i 

Now we develop an upper bound on the evolution of Z for each interval of n/cp squaring/doubling 
operations, conditioned on i belonging to the high probability set G{n). 

Fix an interval j G {0, 1, . . . , Cp}. By the evolution equations (IT6l ) and the bound provided by ([17] ). it is 
easy to see that the greatest possible value for Zq_,_i)„/c is attained by (1 — f3)n/cp doublings followed by 
Pn/cp squarings. Therefore, 

Cascading this argument over all intervals j, obtain 

lgZ(A/«)^lgZ» 

^ 2"/^ Ig Z{M) + — (1 - /3)(2^"/^'' + 2^^'^/^" + ■■■ + 2"^^) (21) 

Cp 

^ 2^^ Ig Z{M) + -{1 -13)- 



Cp 1-2 '^p' 

\ Cpi_2--pP) 

^ 2"''(lg Z{M) + n/cp) asm^cfi 

As M G Wr, Z{M) ^ 2p^, and n/cp ^ 2m\g{2/p), we can bound above as 



^ -2"^ lg(2/p)m (22) 

^ -2"^ as m ^ 2/p 



This shows that 



Z{W^}) ^ 2- 



^ ^ -2''" _ o-2''"0 
' no ' ^ 
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where the equality is due to the definition of n and m, as long as the first m bits of i are in R{m) and the last 
n bits of i are in G{n). The former has probability at least I{W) — e/2 by (031) and the latter has probabihty 
at least 

1 _ ,,exp(-(l - 2efn,(2c,)) , 1 - (Mf>2 + l) exp (_ d - ^^);Wp)>» ) ^ , _ ./^ 

by our choice of c^ and Lemma [TT] 

Putting the two together with the union bound, obtain 



Pr 



Z(H^«) ^ 2-2'"" ^/(VF)-e. D 



The following corollary will be useful in the next section, where we will deal with an approximation to 
the Bhattacharyya parameter. It relaxes the conditions on the polarized set from Proposition |5] 

Corollary 12. Proposition [70| still holds with a modified roughly polarized set (recall the definition of the 
roughly polarized set Wrfrom equation ©J Wr where Wr D Wr and Z ( Wr ) ^ \/3/7" (instead of2p"^) 
with a modified constant cs- 

Proof. The changes that need to be made follow from Equation (|22] |. where Ig Z{M) is used. With the 
extra square root, an extra factor of 1/2 appears outside of the Ig, which means Cp needs to be adjusted 
by a constant factor. In addition, lg{2/p) needs to be adjusted to lg(3/p), but this is also just a constant 
change. D 

5 Efficient construction of polar codes 

The construction of a polar code reduces to determining the frozen set of indices (the generator matrix then 
consists of columns of Gn = K®'^Bn indexed by the non-frozen positions). The core component of the 
efficient construction of a frozen set is estimating the Bhattacharyya parameters of the subchannels Wn ■ 
In the erasure case, this is simple because the evolution equation offered by Proposition |4] is exact. In the 
general case, the naive calculation takes too much time: Wn has an exponentially large output alphabet 
size in terms of iV = 2". 

Our goal, therefore, is to limit the alphabet size of Wn while roughly maintaining the same Bhat- 
tacharyya parameter. With this sort of approach, we can select channels with relatively good Bhattacharyya 
parameters. The idea of approximating the channel behavior by degrading it via output symbol merging is 
due to II22I and variants of it were analyzed in fTOl. The approach is also discussed in the survey ||7] Section 
3.3]. Since we can only achieve an inverse polynomial en^or in estimating the Bhattacharyya parameters 
with a polynomial alphabet, we use the estimation only up to the rough polarization step, and then use the 
explicit description of the subsequent good channels that is implicit in the proof of Proposition [TOl 

We note that revised versions of the Tal-Vardy work ll22l also include a polynomial time algorithm for 
code construction by combining their methods with the analysis of |[T9l . However, as finite-length bounds 
on the speed of polarization were not available to them, they could not claim poly(A^/e) construction time, 
but only c^N time for some unspecified Cg. 

We will first state our binning algorithm, along with its properties, and then conclude the main theorem. 
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5.1 Binning Algorithm 

For our binning, we deal with the marginal distributions of the input bit given an output symbol. A binary- 
input symmetric channel W defines a marginal probability distribution W{y\x). We invert this conditioning 
to form the expression 

p{0\y)=PT{x = 0\W{x) = y)-^ ^^^'°^ 



2FT,{W{x)=y) 
for a uniformly distributed input bit x. In addition, we introduce the one-argument form 

p{y) = Pr(Ty(x) = y) 

X 

for the simple probability that the output is y given an uniformly distributed input bit x. 

Algorithm 2: Binning algorithm 
input ■.W:B-^y,k>0 
output: W :B ^y ^ _ 

1 Initialize new channel W with symbols %, yi ■ ■ -yk with VF(y|a;) = for all y and x G ;B 

2 for y G 3^ do 

Py}^\y) ^ 2Pr^(W{x)=y) 

W{yikpio\y)\ |0) ^ W{y[kpio\y)\ |0) + W{y\0) 
Wiyikp{o\y)i\'^) ^ ^(yLfcp(o|i/)j|l) + Wiy\l) 



6 return W 



Proposition 13. For a binary-input symmetric channel W : B ^ y and all k > 0, there exists a channel 
W : B ^y such that 

H{W) ^ H{W) ^ H{W) + 2 lg{k)/k, \y\!^k + l, 

and the marginal probability distribution W{y\x) is computable, by Algorithm^ in time polynomial in \y\ 
and k. 

We will delay the proof of Proposition [13] to Appendix as the details are somewhat mechanical. We 
note that a slightly different binning strategy [22] can achieve an approximation error of 0{l/k). We chose 
to employ a simple variant that still works for our purposes. 

We will iteratively use the binning algorithm underlying Proposition [13] to select the best channels. The 
following corollary formalizes this. 

Corollary 14. Let Wn indicate the result of using Algorithm [2] after every application of the evolution 
Equations ([D; that is, 

W^'^ = W+~ 

where the -\- or — is chosen depending on the corresponding bit, starting from the least significant one, of 
the binary representation ofi £ {0, 1, . . . , 2" — 1}. Then 

.n+2 - 



iJ(Vr«) ^ H (wj,A ^ if (VF») + 



»^^mw«^ , 2"+'lg(^) 



k 
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Proof. The lower bound is obvious as the operation • never decreases the entropy of the channel, as men- 
tioned in the proof of Proposition [T3l 

For the upper bound, we'd like to consider the error expression summed over all Wn ■ 

Y.^[^^^)-Y. ^(^n^) = Z ^ ( ^"' J - 2"i^(VF) (23) 

i=0 ^ ^ i=0 i=0 ^ ^ 

as Eb£{+ _} H{W'') = H{W) by dV]). At every approximation stage, we have, from Proposition [T3l 

Applying this to every level of the expression (1231 ) (colloquially speaking, we strip off the ~ s n times), 
obtain 



'Y1h(w^A - rH{W) ^ ^(2 + 2 
i=o ^ ^ 



2 „, 2"+2lgA; 
2 I h2") ^ 2_ 



^(*) 1 _ U(W^^\ ic n,.,.^.- hr^nnH^H h^, 2"+^ ^g fc 



Since the sum of all of the errors H I Wn ) — -ff(Wr^ ) is upper bounded by — f"^' ^^'^'^ ^'^°'^ ^^ ^^^° 
upper bounded by — ^-^^ (since no error is negative due to the lower bound). D 

We are now in a position to restate and prove our main theorem (Theorem [T]). 

Theorem. There is an absolute constant fi < oo such that the following holds. Let W be a binary-input 
output-symmetric memoryless channel with capacity I{W). Then there exists a\\r < oo such that for all 
e > and all powers of two N ^ a^y (1/e)'^, there is a deterministic poly(iV) time construction of a binary 
linear code of block length N and rate at least liyV) — £ and a deterministic 0{N\ogN) time decoding 

ArO.49 

algorithm with block error probability at most 2~ 

Proof. Fix an N that is a power of 2, and let no = \g{N). Define m, n, p as they are in Proposition [TOl 
Utilizing the definition of ^ from Corollary [T4]with ^ = ( f ) > let Wr be the set of all channels Wm such 

that H I Wm ) ^ 3p™, and let R{m) be the set of coixesponding indices i. Define the complement of the 

frozen set 

Fn^ = {i I ^ i ^ 2^^° - 1, i™-i G R{m), C"^ e G(no - m)} 

where G{n) is defined in Equation [T9l and the notation i^- = i/2^ mod 2^~^~^^ means the integer with the 

binary representation of the jth through fcth bits of i, inclusive. We note that this set F„g is computable 
in poly(l/e, A^) time: R{nn) is computable in poly(l/e) time because k ^ poly(l/e) and G(no — m) is 
computable in 0{N) time as it is just counting the number of 1 bits in various intervals. 

By Corollary [14] we can conclude that i G R{'m) implies i G i?(m) because Z{Wm) ^ 2p'" implies 

H{w!;r}) ^ Z{W^'') ^ 2p™. This in turn imphes H{W^) sj 3p™ by our choice of k and the approxima- 
tion error guaranteed by Corollary [14] Therefore, we have 

Pr ii G Rim)) ^ Pr (i G Rim)) 

i<2'" i<2'" 
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and also that all M £ Wr satisfy Z{M) ^ y/H{M) ^ ^/3f/^, where the former inequahty is from ([T}. 

Applying Corollary [12] with our modified set Wr, we can now conclude Pr(i G Fno) ^ I{W) — e and 
Z{WS) ^ 2-2'"° for all i in Fno- This implies that 

Taking 6 = .499 and fi = cs,^^ can conclude the existence of an a^y such that for N ^ avi/(l/e)'^, 

i&F„Q 

as such /i satisfies the conditions of Corollary [12] The proof is now complete since by Lemma |2l the block 
error probability of polar codes with a frozen set F under successive cancellation decoding is bounded by 
the sum of the Bhattacharyya parameters of the channels not in F. D 

6 Future work 

The explicit value of /i found in Theorem[T]is a large constant and far from the empirically suggested bound 
of approximately 4. Tighter versions of this analysis should be able to minimize the difference between the 
upper bound suggested by Theorem [Tjand the available lower bounds. 

We hope to extend these results shortly to channels with non-binary input alphabets, utilizing a decom- 
position of channels to prime input alphabet sizes |i8J. Another direction is to study the effect of recursively 
using larger i x i kernels instead of the 2x2 matrix K = ( q } ). Of course in the limit of £ — )• oo, by 
appealing to the behavior of random linear codes we will achieve ^ ^ 2, but the decoding complexity will 
grow as 2^. The trade-off between ^ and £ for fixed £ > 2 might be interesting to study. 
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A Proofs of Z-parameter evolution equations 

The Z-parameter evolution equations are a special case of the lemmas in llT6l . specifically in the appendices 
to Chapters 2 and 3, and the proof techniques used here are based on the proofs of those lemmas. 

Proof of Proposition\3\ This can be done directly by definition. Let y be the output alphabet of Wn- Then 



= \ Y. \/Wn{yi\x®^)Wn{y2\^)Wn{yi\x®l)Wn{y2\l) 

= \ Yl ^Wn{yi\x)Wn{yi\x®l) Y VWn{y2\0)Wn{y2\l) 
x£B,yiGy y2&y 

= Y \/W^n,(yi|0)W^„(yi|l) Y VWn(.y2\0)Wn{y2\l) 

yi&y yz&y 

where the first step is the expansion of the definition of W^ and the rest is arithmetic. D 

Proof of Proposition^ We first show Z{W;^) ^ 2Z{Wn) - Z{Wnf- Again, let 3^ be the output alphabet 
of Wn- Then we have 



ZiW-)^ Y \/Wn{yi,y2\0)Wn{yi,y2\l) (24) 

yi,y2&y 



yi,y2&y y xi£B x2GB 

\ Y. \/(W^n(yi|0)VF„(yi|l))V(VF„(y2|0)W^„(y2|l)) 



2 

y\-m'^y 



\WnbM_ W"n(y2|0) Wn{x)x\\) Wn{y2\\) 

\/l^n(yi|l) Wn{y2\\) Wn{y2\^) vrn(y2|0) 
and we note that we can define a probability mass function p(y) = ^ — zm \ — "^^'^ '^' ^° ^^ write 



Z{Wn? V r W ) / ^n(?/l|0)2 + T^^(j/i|l)2 Wn{,y2W ^Wn{y2\\Y 
2 Z^^^yi^y^JW Wn{3^x\^)Wn{yi\\) Wn{y2\^)Wn{y2\\) 



y\,yi<iy 
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and introducing f{y) = y^Wn{y\0) /Wn{y\l) + v^W„(y|l)/W„(y|0), we can write 

= ^^ E ^ y /(yi)2 + /(?/2)2 - 4 (26) 

Z(W )^ 

^ ^y^ (IE(/(yi)) + ]E(/(y2)) - 2) using Va + b-c i^ V^ + Vb - ^ when a, 6 ^ c 

and since Ey^^p^y)[f{yi)] = 2/Z{Wn), 

= 2Z{Wn) - Z{Wnf 

For the lower bound, we can apply Jensen's inequality twice to the function ^/x"^ + a which is convex for 
o ^ 0, together with /{yi)"^ ^ 4, to obtain 

^ yi,y2^p{y) 



>^i( \M< \M-' 



= Z{Wn)^/l-Z{Wnf. 

We note that p(y) = for all y where either VF„(y|0) or Wri(y|l) is zero, so the expressions involving /(y) 
are well-defined even if /(y) is not defined for all y. 

In the case that TV is a binary erasure channel, the expression (l2fri can be simplified to obtain a tight 
bound. If y is an erasure symbol, then /(y) = 2, and otherwise, p(y) = 0. This means that we simply have 

E y/(yi)2 + /(y2)2-4= E /(y) 
yi,y2^p{yj y^p{yj 

and the equality follows. D 

B Rough polarization for erasure channels 

If W is the binary erasure channel, we have I{Wn) = 1 - Z{Wn) and Z{W-) = 2Z{Wn) - Z{Wl). In 
this case, we can show the following. 

Proposition 15. For the binary erasure channel W,for all a G (3/4, 1), there exists a constant Ca such that 
for all e > and m ^ Ca log(l/e) we have 



Pr (^Z{W^^) ^ 2a™) ^ I{W) - e. 



Proof. We can rearrange the evolution Equation dD and apply Propositions [3] and |4] for the BEC case to 
obtain the equation 
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Since y/z{l + z) + y/{l- z){2- z) < ^3 for all z G [0, 1] (observed originally by g], and Y{W) ^ 1/4, 



we can conclude the geometrically decaying upper bound Ej yY{Wn ) ^ 5(2) • Therefore, by 
Markov's inequality, we have 

Pr[y(t^W) ^ a"] ^ - (^— J . (28) 

We have Ei[Z(wi*^)] = Ei[Z(VF^*2i)] = Z{W), and so 

Pr(^^) min Z{WJi'>) ^ E[WJi^ = Z{W) . 

Since Aa C A^ and A^ is disjoint from A^^, we have Pr(A[^) = 1 — Pr(^a) — Pr(^Q), and we obtain 



Z{W) ^ ,-^, ^ _, Z{W) 1 / 3 



^Mi) ^ 1 - :„„...,. - Pr(^«) ^ 1 - T^^. - o 7- (29) 



mm,g^.(Z(iyi*))) l-2a« 2V4a 



n/2 



where we have used (1281 ) to bound the probability of A^ and Fact|6]to lower bound min^g^i, Z{Wn ). 

Byy FactH Z{w'^'') ^ 2a" for z G ^^. Together with (l29]l we can conclude that for all a £ (3/4, 1), 
there is some constant Cq such that for all e > and m ^ Ca log(l/e), so that 

Pr[Z(VF^)) ^ 20"^] ^ Pr(^^) ^ 1 - Z{W) -e = I{W) - e. D 

i 

C Analytic bound on geometric decay rate of Y{Wn ) 

The bounds on / as stated in Lemma[8]can be found by numerically maximizing / in the range [0,1], but it is 
difficult to verify that / is indeed concave in the range [0, 1] without resorting to more numerical approaches. 
In this section, we offer an analytic justification that / is bounded by a constant less than 1 on the interval 
[0,1]. 

First, we state the following lemma to make future analysis easier. 



Lemma 16. (2 - z) ^'^^f^ ^ 2.1(1 - z). 

With this in hand, we can easily prove the proposition. 

Proof of Lemma^ Using Lemma [T6l we have (where we are newly defining (7(2)): 



/(z) ^ - ( V^(l + z) + ^2A^Y~,) ^ g{z). 

Since g{z) is continuous over [0, 1] and 5(0) < 2 and g{l) < 2, to prove the existence of some A < 1 
such that f{z) ^ A for all z G [0, 1], it is sufficient to show that g{z) / 2 at any point in [0, 1]. 
Therefore, we only need to consider the roots of the equation 

y/z{l + z) + V2JVI- z -2 = 0. 

Expanding the surds, obtain the polynomial equation (which has roots wherever g(z) — 2 has roots) 

h{z) = WOz^ + 620z^ - 259^2 - 422z + 361 = 

which has two complex roots and two real roots at —6.5 and —1. Since h{z) has roots wherever g{z) — 2 
has roots and none occur in [0, 1], we have proven the proposition. D 
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We then prove the lemma, which is an observation that the (2 — z) ^ ^^^^ ^ is very "close" to the linear 
function 2(1 — z), and is indeed bounded above by 2.1(1 — z). 

Proof of Lemma [76l We use the same technique that we used in the proof of Lemma [8] We define the 
continuous function 

/(2) A (2 _ ^)(i _ 2^2-^2) - 2.1(1 - zf . 

After squaring and simplifying, the equation /(z) = implies 

~g{z) = 1 - 64z + 266^2 - 544z^ + 6412"^ - 400z^ + lOOz^ = 

We can pull out a factor of (z — 1)^ to obtain 

1 - 62z + \\\z^ - 200^3 + lOOz'' = 

which is a polynomial that has two complex roots, one root at approximately zq = 0.017, and one root at 
approximately 1.269. As ^(z) has a zero whenever /(z) does and /(z) is continuous on [0, 1], it is sufficient 
to test /(O) ^ 0, /(zq) > (to conclude that zq is not a root of /), and /(I) ^ to conclude the lemma. D 

D Output Symbol Binning 

Proof of Provosition \T3\ First, it is clear that the algorithm runs in time polynomial in |3^| and k; k bits of 
precision is more than sufficient for all of the arithmetic operations, and the operations are done for each 
symbol in y. 

For y ^ y, let ly be the set of y associated with the symbol y; that is, all y such that p{0\y) falls in the 
interval of [0, 1] associated with y (which is [j/k, {j + l)/k) for y = yj). 

For the lower bound, it is clear that H{W) ^ H{W). Juxtaposing the definitions of H{W) and H{W) 
together, obtain (defining the binary entropy function h{x) = — xlgrc — (1 — x) lg(l — x)): 

H{w) = ^p{y)h{p{o\y)) ^ ^ iY.p(y^ I (Hpmm = h{w) 

y& y(zy \yely J 

where the inequality is due to the concavity of h{x). 

Using the fact min, ai/bi ^ Yli ^i/ J2i ^i ^ maxj ai/bi, we can bound 



with the expressions 

which implies, for all y G ly, 



p{y) 2 Y.y&iyP{y) 



m.m.p{Q\y) ^ p(0|y) ^ maxp(0|y) 
y&iy y&iy 



pm) - I ^ pio\y) ^ p{o\y) + l- 



We will need to offer a bound on h{p{0\y)) as a function of h{p{0\y)). h{x) is concave and obeys 
|/i'(x)| ^ Ig A: if 1/k < x < 1 — 1/k. Define the "middle set" y„i = {iji '■ ^ < i < k — 1}, corresponding 
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with intervals where p{0\ym) is in the range l/k < x < I — l/k. Then, by the concavity of h{x), for all 
y£y^ and ye ly, we have h{p{0\y)) ^ h{p{0\y)) +2lg{k)/k. 

We now provide a bound for the remaining symbols yo. Vk-i and y^. yk is trivial because it represents 
all symbols where p{0\y) = 1, and merging those symbols together still results in p{0\y) = 1. For yo, we 
have 

h{p{0\y)) ^ h{l/k) ^ 2lg{k)/k ^ h{p{0\y))+2lg{k)/k 

and similarly for yk~i- 

With these expressions in hand, we can now write 

V&yV&Iy 

yeyyeiy 
i^H{W) + 2lg{k)/k D 
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